Data science is a huge field, it invovles applications and tools like importing, tidying, transformation, visualization, modeling, and communication. Surrounding all these is programming.
Source: R for Data Science
randomdatetimejsonrequestsnumpypandasmatplotlibscikit-learnscipytensorflowpytorchflaskdashcsvjsonnumpypandasrequestsbeautifulsoup4lxmlsqlite3numpypandasnumpypandaspandasmatplotlibseabornscikit-learnscipytensorflowpytorchjupyterdashIt should be possible to explain the laws of physics to a barmaid.
Source: Einstein: His Life and Times
We brief the what and why on pandas and a quickstart of common DataFrame operations with 91APP dataset.
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.
Python has always been a popular language for web scraping, however there is a weak spot in its analysis capability. Python used to switch to a more data-centric language like R or Matlab during the analysis stage.
Index, ndarray, Series, DataFrame¶DataFrameSeriesIndexndarrayimport numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sqlite3
Index and a ndarray can form a Series¶movie_index = pd.Index(['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
movie_ratings = np.array([8, 7.3, 8.5, 8.4])
print(type(movie_index))
print(type(movie_ratings))
<class 'pandas.core.indexes.base.Index'> <class 'numpy.ndarray'>
rating_series = pd.Series(movie_ratings, movie_index)
print(type(rating_series))
print(rating_series)
<class 'pandas.core.series.Series'> The Avengers 8.0 Avengers: Age of Ultron 7.3 Avengers: Infinity War 8.5 Avengers: Endgame 8.4 dtype: float64
rating_series = pd.Series([8, 7.3, 8.5, 8.4], ['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
print(type(rating_series))
print(type(rating_series.index))
print(type(rating_series.values))
<class 'pandas.core.series.Series'> <class 'pandas.core.indexes.base.Index'> <class 'numpy.ndarray'>
Series with the same Index can form a DataFrame¶release_year_series = pd.Series([2012, 2015, 2018, 2019], ['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
movie_df = pd.DataFrame()
movie_df['rating'] = rating_series
movie_df['release_year'] = release_year_series
print(type(movie_df))
movie_df
<class 'pandas.core.frame.DataFrame'>
| rating | release_year | |
|---|---|---|
| The Avengers | 8.0 | 2012 |
| Avengers: Age of Ultron | 7.3 | 2015 |
| Avengers: Infinity War | 8.5 | 2018 |
| Avengers: Endgame | 8.4 | 2019 |
print(type(movie_df['rating']))
print(type(movie_df['rating'].index))
<class 'pandas.core.series.Series'> <class 'pandas.core.indexes.base.Index'>
Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
Source: NTU資料分析與決策社
!file --mime-encoding *.csv
!file --mime-encoding *.txt
BehaviorDataForNTU.csv: utf-8 MemberDataForNTU.csv: us-ascii OrderDataForNTU.csv: us-ascii OrderSlaveDataForNTU.txt: us-ascii
read_csv functions¶# for example
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt')
/Users/kuoyaojen/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (5,6) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
order_slave.shape
(1159830, 15)
order_slave.dtypes
UUID object TransactionNum object TradesDateTime object ChannelType object ChannelDetail object PaymentType object ShippingType object OuterProductSkuCode object ProductSkuCode float64 SalePageCode float64 Qty int64 TotalSalesAmount int64 TotalPrice int64 TotalDiscount int64 Status object dtype: object
dtypes = {
'PaymentType': 'str',
'ShippingType': 'str'
}
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt', dtype=dtypes)
DtypeWarning?¶Pandas is built on numpy, numpy records missing values with np.NaN which is a float type.
print(type(np.NaN))
<class 'float'>
dtypes = {
'CategoryId': 'str',
'SearchKeyWord': 'str',
'TransactionNum': 'str'
}
behavior = pd.read_csv('BehaviorDataForNTU.csv', dtype=dtypes)
member = pd.read_csv('MemberDataForNTU.csv')
order_main = pd.read_csv('OrderDataForNTU.csv')
shape attribute to check layout¶print(behavior.shape)
print(member.shape)
print(order_main.shape)
print(order_slave.shape)
(15081048, 15) (159835, 11) (508083, 18) (1159830, 15)
head method for a peek¶behavior.head()
| VisitorId | OperationSystem | SessionNumber | TrafficSourceCategory | HitDateTime | BehaviorType | SourceType | OnlineMemberId | CategoryId | SearchKeyWord | TransactionNum | ProductPrice | ProductQuantity | ProductId | TransactionRevenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | d960a83d-f413-4659-813c-6cd1610a1292 | iOS | 3.0 | Direct | 2018-05-16 23:14:31.500 | Purchase | APP | 2783844908 | NaN | NaN | 1e90QjPFvmpWFROtMl0nBA9w3dD= | 1380.0 | 1.0 | 4298287.0 | 4330.0 |
| 1 | 43a28e13-df38-45a3-6038-f676b0ac61ba | Windows | 1.0 | Direct | 2018-12-27 20:52:30.428 | Purchase | WEB | 3583134573 | NaN | NaN | RRL2Ah/8qPfZDdARqQ/cT2gSO0L= | 1447.0 | 1.0 | 4899494.0 | 4620.0 |
| 2 | cae70e93-fec0-4c52-38a5-da5c0c77e976 | Android | 1.0 | Direct | 2018-06-07 06:17:15.701 | Purchase | WEB | 3384591959 | NaN | NaN | wdLh1oFx6lNjV1ENQtBVFglDVcD= | 1280.0 | 1.0 | 4305841.0 | 2460.0 |
| 3 | e903d188-cf9d-423a-b058-93d0c882ba2d | Android | 1.0 | 2018-05-09 12:06:41.670 | Purchase | WEB | 47808784R2 | NaN | NaN | csoMr0O2AdKUG2M3k9m5HjwymtZ= | 1280.0 | 1.0 | 4259098.0 | 2860.0 | |
| 4 | 05d9c7be-ad65-48d6-17c7-2d2ddb14ab37 | Android | 1.0 | 2018-03-12 07:34:54.457 | Purchase | WEB | 3483273161 | NaN | NaN | A75CPoWttFCvO3/M5NQxH51LLtb= | 690.0 | 1.0 | 4167807.0 | 2170.0 |
member.head()
| UUID | OnlineMemberId | RegisterSourceTypeDef | RegisterDate | Gender | Birthday | IsAppInstalled | IsEnableEmail | IsEnableShortMessage | IsEnablePushNotification | MemberCardLevel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | X03665298 | NaN | Store | 2011/12/4 | Female | 1965/1/10 | False | True | True | True | 10 |
| 1 | T04434917 | NaN | Store | 2011/12/19 | Female | 1900/1/1 | False | True | True | True | 10 |
| 2 | T03441135 | NaN | Store | 2011/12/19 | Female | 1983/1/18 | False | True | True | True | 20 |
| 3 | W03686173 | NaN | Store | 2011/12/28 | Female | 1980/1/1 | False | True | True | True | 10 |
| 4 | W03441148 | NaN | Store | 2011/12/19 | Female | 1900/1/1 | False | True | True | True | 10 |
order_main.head()
| UUID | TransactionNum | TradesDate | ChannelType | ChannelDetail | PaymentType | ShippingType | BeforeMemberTierLevel | AfterMemberTierLevel | TsCount | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | U2131156 | 84GbV1FZLS+u1nu1FgX7AxmHsab= | 2018/7/10 | Pos | A039 | Cash | Store | 0 | 0 | 1 | 1 | 2380 | 2380 | 0 | 0 | 0 | 0 | Finish |
| 1 | V02663974 | g3sJev+nMFHCc7J+3MIkXaeA3wV= | 2018/7/3 | Pos | A042 | Cash | Store | 0 | 0 | 1 | 1 | 390 | 390 | 0 | 0 | 0 | 0 | Finish |
| 2 | R03555577 | jajyDU5zj3+0NFjBDzIPiijAdDD= | 2018/7/4 | Pos | A050 | Cash | Store | 0 | 0 | 1 | 1 | 1480 | 1480 | 0 | 0 | 0 | 0 | Finish |
| 3 | O07191104 | W5qdVNdgnsWEx8a7ikzHw/KbGI6= | 2018/7/7 | Pos | A051 | Cash | Store | 0 | 0 | 1 | 1 | 1380 | 1380 | 0 | 0 | 0 | 0 | Finish |
| 4 | U99462295 | zVBGf/pLjTKHQFQUEQ5zUTvS8Yp= | 2018/7/8 | Pos | A054 | Cash | Store | 0 | 0 | 1 | 1 | 1280 | 1280 | 0 | 0 | 0 | 0 | Finish |
order_slave.head()
| UUID | TransactionNum | TradesDateTime | ChannelType | ChannelDetail | PaymentType | ShippingType | OuterProductSkuCode | ProductSkuCode | SalePageCode | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T04536285 | o2MqkS6y1AR0o4shLXAyZKye9Ct= | 2016-04-01 18:07:20 | Pos | C030 | NaN | NaN | 746140050903 | NaN | NaN | 1 | 0 | 0 | 0 | Finish |
| 1 | T04536285 | o2MqkS6y1AR0o4shLXAyZKye9Ct= | 2016-04-01 18:07:20 | Pos | C030 | NaN | NaN | 746140050902 | NaN | NaN | -1 | 0 | 0 | 0 | Return |
| 2 | P1674651 | gfsCr+82rsF9Qm1sU5KtXiKwlnN= | 2016-04-03 16:10:27 | Pos | A057 | NaN | NaN | 935200275002 | NaN | NaN | 1 | 0 | 0 | 0 | Finish |
| 3 | P1674651 | gfsCr+82rsF9Qm1sU5KtXiKwlnN= | 2016-04-03 16:10:27 | Pos | A057 | NaN | NaN | 935200275001 | NaN | NaN | -1 | 0 | 0 | 0 | Return |
| 4 | N00537019 | wEHcPVo+2/Jzi1Lj5RfCeG5JD+6= | 2016-04-03 18:23:45 | Pos | A057 | NaN | NaN | 116110010002 | NaN | NaN | -1 | 0 | 0 | 0 | Return |
BehaviorDataForNTU.csv¶conn = sqlite3.connect('ntu_dac_91APP.db')
query_str = """
SELECT *
FROM order_slave
LIMIT 5;
"""
pd.read_sql(query_str, conn)
| UUID | TransactionNum | TradesDateTime | ChannelType | ChannelDetail | PaymentType | ShippingType | OuterProductSkuCode | ProductSkuCode | SalePageCode | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | T04536285 | o2MqkS6y1AR0o4shLXAyZKye9Ct= | 2016-04-01 18:07:20 | Pos | C030 | None | None | 7.461401e+11 | None | None | 1 | 0 | 0 | 0 | Finish |
| 1 | T04536285 | o2MqkS6y1AR0o4shLXAyZKye9Ct= | 2016-04-01 18:07:20 | Pos | C030 | None | None | 7.461401e+11 | None | None | -1 | 0 | 0 | 0 | Return |
| 2 | P1674651 | gfsCr+82rsF9Qm1sU5KtXiKwlnN= | 2016-04-03 16:10:27 | Pos | A057 | None | None | 9.352003e+11 | None | None | 1 | 0 | 0 | 0 | Finish |
| 3 | P1674651 | gfsCr+82rsF9Qm1sU5KtXiKwlnN= | 2016-04-03 16:10:27 | Pos | A057 | None | None | 9.352003e+11 | None | None | -1 | 0 | 0 | 0 | Return |
| 4 | N00537019 | wEHcPVo+2/Jzi1Lj5RfCeG5JD+6= | 2016-04-03 18:23:45 | Pos | A057 | None | None | 1.161100e+11 | None | None | -1 | 0 | 0 | 0 | Return |
DataFrame object¶shapeinfo()head()/tail()describe()order_main.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 508083 entries, 0 to 508082 Data columns (total 18 columns): UUID 508083 non-null object TransactionNum 508083 non-null object TradesDate 508083 non-null object ChannelType 508083 non-null object ChannelDetail 508083 non-null object PaymentType 508083 non-null object ShippingType 508083 non-null object BeforeMemberTierLevel 508083 non-null int64 AfterMemberTierLevel 508083 non-null int64 TsCount 508083 non-null int64 Qty 508083 non-null int64 TotalSalesAmount 508083 non-null int64 TotalPrice 508083 non-null int64 TotalDiscount 508083 non-null int64 TotalPromotionDiscount 508083 non-null int64 TotalCouponDiscount 508083 non-null int64 TotalLoyaltyDiscount 508083 non-null int64 Status 508083 non-null object dtypes: int64(10), object(8) memory usage: 69.8+ MB
order_main.describe()
| BeforeMemberTierLevel | AfterMemberTierLevel | TsCount | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 508083.0 | 508083.0 | 508083.000000 | 508083.000000 | 508083.000000 | 508083.000000 | 508083.000000 | 508083.000000 | 508083.000000 | 508083.0 |
| mean | 0.0 | 0.0 | 2.247958 | 1.898645 | 2642.526402 | 2693.586961 | -51.060559 | -3.998362 | -47.062197 | 0.0 |
| std | 0.0 | 0.0 | 1.959992 | 2.300044 | 3488.535635 | 3502.082785 | 178.671054 | 61.354120 | 168.924130 | 0.0 |
| min | 0.0 | 0.0 | 1.000000 | -36.000000 | -53790.000000 | -53790.000000 | -9100.000000 | -9100.000000 | -5000.000000 | 0.0 |
| 25% | 0.0 | 0.0 | 1.000000 | 1.000000 | 1280.000000 | 1280.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 50% | 0.0 | 0.0 | 2.000000 | 1.000000 | 1980.000000 | 2070.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| 75% | 0.0 | 0.0 | 3.000000 | 3.000000 | 3460.000000 | 3560.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 |
| max | 0.0 | 0.0 | 162.000000 | 162.000000 | 255960.000000 | 255960.000000 | 1000.000000 | 0.000000 | 1000.000000 | 0.0 |
DataFrame operations¶DataFrame¶cols_to_select = ['RegisterDate', 'Gender', 'Birthday']
member[cols_to_select]
| RegisterDate | Gender | Birthday | |
|---|---|---|---|
| 0 | 2011/12/4 | Female | 1965/1/10 |
| 1 | 2011/12/19 | Female | 1900/1/1 |
| 2 | 2011/12/19 | Female | 1983/1/18 |
| 3 | 2011/12/28 | Female | 1980/1/1 |
| 4 | 2011/12/19 | Female | 1900/1/1 |
| ... | ... | ... | ... |
| 159830 | 2018/12/19 | Female | 1966/1/23 |
| 159831 | 2018/12/18 | Female | 1963/12/18 |
| 159832 | 2018/12/16 | Female | 1966/11/7 |
| 159833 | 2018/12/25 | Female | 1986/6/15 |
| 159834 | 2018/12/29 | Female | 1959/3/26 |
159835 rows × 3 columns
DataFrame¶rows_to_select = member['Gender'] == 'Male'
member[rows_to_select]
| UUID | OnlineMemberId | RegisterSourceTypeDef | RegisterDate | Gender | Birthday | IsAppInstalled | IsEnableEmail | IsEnableShortMessage | IsEnablePushNotification | MemberCardLevel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | S04520222 | NaN | Store | 2011/12/25 | Male | 1982/4/11 | False | True | True | True | 10 |
| 108 | U03583772 | NaN | Store | 2012/1/1 | Male | 1981/10/16 | False | True | True | True | 10 |
| 109 | R03583770 | NaN | Store | 2011/12/18 | Male | 1982/10/14 | False | True | True | True | 10 |
| 117 | W04520226 | NaN | Store | 2011/12/25 | Male | 1980/11/16 | False | True | True | True | 20 |
| 120 | P03583767 | NaN | Store | 2011/12/10 | Male | 1984/11/23 | False | True | True | True | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 159198 | T96759858 | 45844100P0 | LocationWizard | 2018/12/27 | Male | 1974/3/29 | True | True | True | True | 10 |
| 159578 | T96399026 | 3885746594 | LocationWizard | 2018/12/11 | Male | 1976/9/25 | True | True | True | True | 10 |
| 159655 | S96478806 | 3184873530 | LocationWizard | 2018/12/16 | Male | 1992/12/16 | True | False | False | False | 10 |
| 159739 | O96696810 | 3185458431 | LocationWizard | 2018/12/22 | Male | 1971/11/8 | False | True | True | True | 10 |
| 159742 | N96481013 | 3285768749 | LocationWizard | 2018/12/13 | Male | 1960/8/20 | True | True | True | True | 10 |
1927 rows × 11 columns
DataFrame¶cols_to_select = ['RegisterDate', 'Gender', 'Birthday']
rows_to_select = member['Gender'] == 'Male'
member[rows_to_select][cols_to_select]
| RegisterDate | Gender | Birthday | |
|---|---|---|---|
| 53 | 2011/12/25 | Male | 1982/4/11 |
| 108 | 2012/1/1 | Male | 1981/10/16 |
| 109 | 2011/12/18 | Male | 1982/10/14 |
| 117 | 2011/12/25 | Male | 1980/11/16 |
| 120 | 2011/12/10 | Male | 1984/11/23 |
| ... | ... | ... | ... |
| 159198 | 2018/12/27 | Male | 1974/3/29 |
| 159578 | 2018/12/11 | Male | 1976/9/25 |
| 159655 | 2018/12/16 | Male | 1992/12/16 |
| 159739 | 2018/12/22 | Male | 1971/11/8 |
| 159742 | 2018/12/13 | Male | 1960/8/20 |
1927 rows × 3 columns
DataFrame¶loc[]: indexing with Indexiloc[]: indexing with absolute positionsmale_members = member[rows_to_select]
male_members.loc[[53, 108, 109, 117, 120]]
| UUID | OnlineMemberId | RegisterSourceTypeDef | RegisterDate | Gender | Birthday | IsAppInstalled | IsEnableEmail | IsEnableShortMessage | IsEnablePushNotification | MemberCardLevel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | S04520222 | NaN | Store | 2011/12/25 | Male | 1982/4/11 | False | True | True | True | 10 |
| 108 | U03583772 | NaN | Store | 2012/1/1 | Male | 1981/10/16 | False | True | True | True | 10 |
| 109 | R03583770 | NaN | Store | 2011/12/18 | Male | 1982/10/14 | False | True | True | True | 10 |
| 117 | W04520226 | NaN | Store | 2011/12/25 | Male | 1980/11/16 | False | True | True | True | 20 |
| 120 | P03583767 | NaN | Store | 2011/12/10 | Male | 1984/11/23 | False | True | True | True | 10 |
male_members = member[rows_to_select]
male_members.iloc[[0, 1, 2, 3, 4]]
| UUID | OnlineMemberId | RegisterSourceTypeDef | RegisterDate | Gender | Birthday | IsAppInstalled | IsEnableEmail | IsEnableShortMessage | IsEnablePushNotification | MemberCardLevel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | S04520222 | NaN | Store | 2011/12/25 | Male | 1982/4/11 | False | True | True | True | 10 |
| 108 | U03583772 | NaN | Store | 2012/1/1 | Male | 1981/10/16 | False | True | True | True | 10 |
| 109 | R03583770 | NaN | Store | 2011/12/18 | Male | 1982/10/14 | False | True | True | True | 10 |
| 117 | W04520226 | NaN | Store | 2011/12/25 | Male | 1980/11/16 | False | True | True | True | 20 |
| 120 | P03583767 | NaN | Store | 2011/12/10 | Male | 1984/11/23 | False | True | True | True | 10 |
DataFrame¶sort_values method: sorting with a specific columnsort_index method: sorting with the Index of DataFrameorder_main.sort_values('TotalSalesAmount')
| UUID | TransactionNum | TradesDate | ChannelType | ChannelDetail | PaymentType | ShippingType | BeforeMemberTierLevel | AfterMemberTierLevel | TsCount | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 485076 | X93677453 | sVy7Na3OtozUpbF36AULY2AtQHN= | 2018/11/15 | Pos | C040 | Cash | Store | 0 | 0 | 21 | -21 | -53790 | -53790 | 0 | 0 | 0 | 0 | Return |
| 319578 | W07477115 | LygZlpowSkpJ75b+yz8kh2FSj5H= | 2018/10/20 | Pos | A062 | Cash | Store | 0 | 0 | 17 | -17 | -53480 | -53480 | 0 | 0 | 0 | 0 | Return |
| 391437 | W05026959 | Nwtrv1T4K2fMSDhUyEEwICixRhN= | 2018/1/27 | Pos | C016 | Cash | Store | 0 | 0 | 2 | -2 | -48600 | -48600 | 0 | 0 | 0 | 0 | Return |
| 26575 | Q04238600 | aIt3Om1iHgkaEYIyW3PqppC+6Oj= | 2018/12/25 | Pos | A040 | Cash | Store | 0 | 0 | 7 | -7 | -45080 | -45080 | 0 | 0 | 0 | 0 | Return |
| 467183 | X94626005 | 0IPFjgAPUpaMKon2xaa13Wojcw6= | 2019/1/14 | Pos | A062 | Cash | Store | 0 | 0 | 12 | -12 | -39590 | -39590 | 0 | 0 | 0 | 0 | Return |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 467167 | W09391463 | jWG1/FVzp3Mxf0q4QtcXgbNV4rb= | 2019/1/3 | Pos | B055 | Cash | Store | 0 | 0 | 27 | 27 | 93160 | 93160 | 0 | 0 | 0 | 0 | Finish |
| 467081 | W7108275 | JlNXzMm8L4rju/V9iLB0WYYlb76= | 2019/1/29 | Pos | A015 | Cash | Store | 0 | 0 | 14 | 14 | 99920 | 99920 | 0 | 0 | 0 | 0 | Finish |
| 485054 | W05580065 | OvWmBv66dTEPr2UdrUkudTdyY+Z= | 2018/11/20 | Pos | A015 | Cash | Store | 0 | 0 | 19 | 19 | 118520 | 118520 | 0 | 0 | 0 | 0 | Finish |
| 391324 | W07813704 | LPKULMkVNK+ptb4he4ApOcvjV1R= | 2018/1/17 | Pos | A054 | Cash | Store | 0 | 0 | 85 | 85 | 159740 | 159840 | -100 | 0 | -100 | 0 | Finish |
| 413632 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019/4/14 | Pos | C019 | Cash | Store | 0 | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
508083 rows × 18 columns
male_members.sort_index(ascending=False)
| UUID | OnlineMemberId | RegisterSourceTypeDef | RegisterDate | Gender | Birthday | IsAppInstalled | IsEnableEmail | IsEnableShortMessage | IsEnablePushNotification | MemberCardLevel | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 159742 | N96481013 | 3285768749 | LocationWizard | 2018/12/13 | Male | 1960/8/20 | True | True | True | True | 10 |
| 159739 | O96696810 | 3185458431 | LocationWizard | 2018/12/22 | Male | 1971/11/8 | False | True | True | True | 10 |
| 159655 | S96478806 | 3184873530 | LocationWizard | 2018/12/16 | Male | 1992/12/16 | True | False | False | False | 10 |
| 159578 | T96399026 | 3885746594 | LocationWizard | 2018/12/11 | Male | 1976/9/25 | True | True | True | True | 10 |
| 159198 | T96759858 | 45844100P0 | LocationWizard | 2018/12/27 | Male | 1974/3/29 | True | True | True | True | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 120 | P03583767 | NaN | Store | 2011/12/10 | Male | 1984/11/23 | False | True | True | True | 10 |
| 117 | W04520226 | NaN | Store | 2011/12/25 | Male | 1980/11/16 | False | True | True | True | 20 |
| 109 | R03583770 | NaN | Store | 2011/12/18 | Male | 1982/10/14 | False | True | True | True | 10 |
| 108 | U03583772 | NaN | Store | 2012/1/1 | Male | 1981/10/16 | False | True | True | True | 10 |
| 53 | S04520222 | NaN | Store | 2011/12/25 | Male | 1982/4/11 | False | True | True | True | 10 |
1927 rows × 11 columns
DataFrame¶map with a dictmap with a function(or a lambda expression)member['RegisterSourceTypeDef'].unique()
array(['Store', 'iOSApp', 'LocationWizard', 'Web', 'AndroidApp', nan],
dtype=object)
register_source_dict = {
'Store': 'Non mobile app',
'LocationWizard': 'Non mobile app',
'Web': 'Non mobile app',
'AndroidApp': 'Mobile app',
'iOSApp': 'Mobile app',
np.NaN: 'Non mobile app'
}
member['RegisterSourceTypeDef'].map(register_source_dict).value_counts()
Non mobile app 129817 Mobile app 30018 Name: RegisterSourceTypeDef, dtype: int64
def register_source_func(x):
if x in ['AndroidApp', 'iOSApp']:
return 'Mobile app'
else:
return 'Non mobile app'
member['RegisterSourceTypeDef'].map(register_source_func).value_counts()
Non mobile app 129817 Mobile app 30018 Name: RegisterSourceTypeDef, dtype: int64
DataFrame with built-in methods¶order_main['TotalSalesAmount'].sum()
1342622742
DataFrame with built-in methods¶order_main.groupby('PaymentType')['TotalSalesAmount'].sum()
PaymentType ATM 91308 Cash 1093109830 CreditCardOnce 80850669 Family 75396620 LinePay 6214910 SevenEleven 86959405 Name: TotalSalesAmount, dtype: int64
DataFrame operations¶datetime typedatetime type¶print(behavior['HitDateTime'][:100])
0 2018-05-16 23:14:31.500
1 2018-12-27 20:52:30.428
2 2018-06-07 06:17:15.701
3 2018-05-09 12:06:41.670
4 2018-03-12 07:34:54.457
...
95 2018-11-27 20:07:52.028
96 2018-07-16 23:21:10.482
97 2018-07-17 23:15:20.457
98 2018-07-17 21:44:05.717
99 2018-06-12 08:27:47.882
Name: HitDateTime, Length: 100, dtype: object
print(pd.to_datetime(behavior['HitDateTime'][:100]))
0 2018-05-16 23:14:31.500
1 2018-12-27 20:52:30.428
2 2018-06-07 06:17:15.701
3 2018-05-09 12:06:41.670
4 2018-03-12 07:34:54.457
...
95 2018-11-27 20:07:52.028
96 2018-07-16 23:21:10.482
97 2018-07-17 23:15:20.457
98 2018-07-17 21:44:05.717
99 2018-06-12 08:27:47.882
Name: HitDateTime, Length: 100, dtype: datetime64[ns]
print(member['RegisterDate'])
0 2011/12/4
1 2011/12/19
2 2011/12/19
3 2011/12/28
4 2011/12/19
...
159830 2018/12/19
159831 2018/12/18
159832 2018/12/16
159833 2018/12/25
159834 2018/12/29
Name: RegisterDate, Length: 159835, dtype: object
print(pd.to_datetime(member['RegisterDate']))
0 2011-12-04
1 2011-12-19
2 2011-12-19
3 2011-12-28
4 2011-12-19
...
159830 2018-12-19
159831 2018-12-18
159832 2018-12-16
159833 2018-12-25
159834 2018-12-29
Name: RegisterDate, Length: 159835, dtype: datetime64[ns]
isnull and notnull to check if np.NaN existsdropna to drop rows with np.NaNfillna to fill np.NaN with specific valuesprint(member.shape[0])
print("===")
print(member['RegisterSourceTypeDef'].isnull().sum())
print(member['RegisterSourceTypeDef'].notnull().sum())
print("===")
print(member.dropna().shape[0])
print("===")
print(member['RegisterSourceTypeDef'].fillna('Unknown').isnull().sum())
print(member['RegisterSourceTypeDef'].fillna('Unknown').notnull().sum())
159835 === 14 159821 === 94067 === 0 159835
merge function on columnjoin method on index# Retrieve the order_main of max TotalSalesAmount
max_ttl_sales_amt = order_main['TotalSalesAmount'].max()
max_order_main = order_main[order_main.TotalSalesAmount == max_ttl_sales_amt]
max_order_main
| UUID | TransactionNum | TradesDate | ChannelType | ChannelDetail | PaymentType | ShippingType | BeforeMemberTierLevel | AfterMemberTierLevel | TsCount | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 413632 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019/4/14 | Pos | C019 | Cash | Store | 0 | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
# Retrieve the order_slave of max TotalSalesAmount
max_transaction_num = max_order_main['TransactionNum'].values[0]
max_order_slave = order_slave[order_slave['TransactionNum'] == max_transaction_num]
max_order_slave
| UUID | TransactionNum | TradesDateTime | ChannelType | ChannelDetail | PaymentType | ShippingType | OuterProductSkuCode | ProductSkuCode | SalePageCode | Qty | TotalSalesAmount | TotalPrice | TotalDiscount | Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1087514 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087515 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087516 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087517 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087519 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1087679 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087680 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087681 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087682 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
| 1087683 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | 255960 | 255960 | 0 | Finish |
162 rows × 15 columns
UUID or TransactionNum as our foreign keys¶pd.merge(max_order_slave, max_order_main, left_on='UUID', right_on='UUID', suffixes=('_slave', '_main'))
| UUID | TransactionNum_slave | TradesDateTime | ChannelType_slave | ChannelDetail_slave | PaymentType_slave | ShippingType_slave | OuterProductSkuCode | ProductSkuCode | SalePageCode | ... | AfterMemberTierLevel | TsCount | Qty_main | TotalSalesAmount_main | TotalPrice_main | TotalDiscount_main | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | Status_main | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 1 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 2 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 3 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 4 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 157 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 158 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 159 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 160 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| 161 | R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
162 rows × 32 columns
join method, set FK to index first¶max_order_slave = max_order_slave.set_index('UUID')
max_order_main = max_order_main.set_index('UUID')
max_order_slave.join(max_order_main, lsuffix='_slave', rsuffix='_main')
| TransactionNum_slave | TradesDateTime | ChannelType_slave | ChannelDetail_slave | PaymentType_slave | ShippingType_slave | OuterProductSkuCode | ProductSkuCode | SalePageCode | Qty_slave | ... | AfterMemberTierLevel | TsCount | Qty_main | TotalSalesAmount_main | TotalPrice_main | TotalDiscount_main | TotalPromotionDiscount | TotalCouponDiscount | TotalLoyaltyDiscount | Status_main | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| UUID | |||||||||||||||||||||
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990901 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
| R04234145 | zK6XGNghLYS3ZuQloS4ocxEXeW8= | 2019-04-14 17:58:33 | Pos | C019 | NaN | NaN | 509120990907 | NaN | NaN | 1 | ... | 0 | 162 | 162 | 255960 | 255960 | 0 | 0 | 0 | 0 | Finish |
162 rows × 31 columns
np.nan with order_slave['PaymentType'] and order_slave['ShippingType'] for the max TotalSalesAmount order(TransactionNum: zK6XGNghLYS3ZuQloS4ocxEXeW8=)# Import 4 91APP files into your own Python environment
# Refer to previous slides
# Fill np.nan with order_slave['PaymentType'] and order_slave['ShippingType'] for the max TotalSalesAmount order(TransactionNum: zK6XGNghLYS3ZuQloS4ocxEXeW8=)
# Retrieve the order_main of max TotalSalesAmount
max_ttl_sales_amt = order_main['TotalSalesAmount'].max()
max_order_main = order_main[order_main.TotalSalesAmount == max_ttl_sales_amt]
# Retrieve the order_slave of max TotalSalesAmount
max_transaction_num = max_order_main['TransactionNum'].values[0]
max_order_slave = order_slave[order_slave['TransactionNum'] == max_transaction_num]
# Retrieve imputed values
impute_payment_type = max_order_main['PaymentType'].values[0]
impute_shipping_type = max_order_main['ShippingType'].values[0]
# Before fillna
print(max_order_slave['PaymentType'])
print(max_order_slave['ShippingType'])
1087514 NaN
1087515 NaN
1087516 NaN
1087517 NaN
1087519 NaN
...
1087679 NaN
1087680 NaN
1087681 NaN
1087682 NaN
1087683 NaN
Name: PaymentType, Length: 162, dtype: object
1087514 NaN
1087515 NaN
1087516 NaN
1087517 NaN
1087519 NaN
...
1087679 NaN
1087680 NaN
1087681 NaN
1087682 NaN
1087683 NaN
Name: ShippingType, Length: 162, dtype: object
# After fillna
print(max_order_slave['PaymentType'].fillna(impute_payment_type))
print(max_order_slave['ShippingType'].fillna(impute_shipping_type))
1087514 Cash
1087515 Cash
1087516 Cash
1087517 Cash
1087519 Cash
...
1087679 Cash
1087680 Cash
1087681 Cash
1087682 Cash
1087683 Cash
Name: PaymentType, Length: 162, dtype: object
1087514 Store
1087515 Store
1087516 Store
1087517 Store
1087519 Store
...
1087679 Store
1087680 Store
1087681 Store
1087682 Store
1087683 Store
Name: ShippingType, Length: 162, dtype: object
We briefly introduce visualization and how to use popular packages for visualization in Python with 91APP dataset.
Visualization is a tool that addresses the gap between data, function, formula, and viewers. Effective visualizations transform abstract collections of numbers into shapes that viewers quickly grasp and understand.
what are $x$ and $f$?
print(x)
print(f)
[ 0. 0.25645654 0.51291309 0.76936963 1.02582617 1.28228272 1.53873926 1.7951958 2.05165235 2.30810889 2.56456543 2.82102197 3.07747852 3.33393506 3.5903916 3.84684815 4.10330469 4.35976123 4.61621778 4.87267432 5.12913086 5.38558741 5.64204395 5.89850049 6.15495704 6.41141358 6.66787012 6.92432667 7.18078321 7.43723975 7.69369629 7.95015284 8.20660938 8.46306592 8.71952247 8.97597901 9.23243555 9.4888921 9.74534864 10.00180518 10.25826173 10.51471827 10.77117481 11.02763136 11.2840879 11.54054444 11.79700098 12.05345753 12.30991407 12.56637061] [ 0.00000000e+00 2.53654584e-01 4.90717552e-01 6.95682551e-01 8.55142763e-01 9.58667853e-01 9.99486216e-01 9.74927912e-01 8.86599306e-01 7.40277997e-01 5.45534901e-01 3.15108218e-01 6.40702200e-02 -1.91158629e-01 -4.33883739e-01 -6.48228395e-01 -8.20172255e-01 -9.38468422e-01 -9.95379113e-01 -9.87181783e-01 -9.14412623e-01 -7.81831482e-01 -5.98110530e-01 -3.75267005e-01 -1.27877162e-01 1.27877162e-01 3.75267005e-01 5.98110530e-01 7.81831482e-01 9.14412623e-01 9.87181783e-01 9.95379113e-01 9.38468422e-01 8.20172255e-01 6.48228395e-01 4.33883739e-01 1.91158629e-01 -6.40702200e-02 -3.15108218e-01 -5.45534901e-01 -7.40277997e-01 -8.86599306e-01 -9.74927912e-01 -9.99486216e-01 -9.58667853e-01 -8.55142763e-01 -6.95682551e-01 -4.90717552e-01 -2.53654584e-01 -4.89858720e-16]
plot_sinx(x, f)
plot_sigmoid(x, S)
plot_logistic_cost_function(x)

Source: Coursera | Machine Learning

Source: Coursera | Machine Learning
plot_multiple_decision_boundary()
In 1812, Napoleon marched to Moscow. 98% of his soldiers died. The simple but fascinating temperature line below the viz shows how cold ultimately defeated Napoleon’s army.

A 2007 TED talk for the Swedish scientist shared his passion with the world. It shows the relationship between wealth and health.

pandasseabornmatplotlibplotlyDataFrame or Seriespandasseabornlist, ndarray, or Seriesmatplotlibplotlyimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
order_slave_dtypes = {
'PaymentType': 'str',
'ShippingType': 'str'
}
behavior_dtypes = {
'CategoryId': 'str',
'SearchKeyWord': 'str',
'TransactionNum': 'str'
}
member = pd.read_csv('MemberDataForNTU.csv')
order_main = pd.read_csv('OrderDataForNTU.csv')
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt', dtype=order_slave_dtypes)
behavior = pd.read_csv('BehaviorDataForNTU.csv', dtype=behavior_dtypes)
pandas¶DataFrame class has a .plot() method with several key parameters.
pandas¶order_main.groupby('PaymentType')['TotalSalesAmount'].sum().sort_values()
PaymentType ATM 91308 LinePay 6214910 Family 75396620 CreditCardOnce 80850669 SevenEleven 86959405 Cash 1093109830 Name: TotalSalesAmount, dtype: int64
pandas following groupby summary¶sale_amount_by_payment_type = order_main.groupby('PaymentType')['TotalSalesAmount'].sum().sort_values()
sale_amount_by_payment_type.plot(kind='barh')
plt.show()
pandas¶order_main['TradesDate'] = pd.to_datetime(order_main['TradesDate'])
order_main.groupby('TradesDate')['TotalSalesAmount'].sum()
TradesDate
2016-01-01 920950
2016-01-02 966747
2016-01-03 704524
2016-01-04 660950
2016-01-05 717040
...
2019-04-27 1575247
2019-04-28 1285845
2019-04-29 1292115
2019-04-30 2225314
2019-05-01 26440
Name: TotalSalesAmount, Length: 1217, dtype: int64
sale_amount_by_trades_date = order_main.groupby('TradesDate')['TotalSalesAmount'].sum()
sale_amount_by_trades_date.plot(kind='line')
plt.show()
seaborn¶Using different plotting functions
barplot function for a barlineplot function for a lineDataFrame with pandas¶sale_amount_by_payment_type_df = pd.DataFrame(sale_amount_by_payment_type).reset_index().sort_values('TotalSalesAmount', ascending=False)
sale_amount_by_payment_type_df
| PaymentType | TotalSalesAmount | |
|---|---|---|
| 5 | Cash | 1093109830 |
| 4 | SevenEleven | 86959405 |
| 3 | CreditCardOnce | 80850669 |
| 2 | Family | 75396620 |
| 1 | LinePay | 6214910 |
| 0 | ATM | 91308 |
barplot function for a bar¶sns.barplot(x='TotalSalesAmount', y='PaymentType', data=sale_amount_by_payment_type_df)
plt.show()
DataFrame with pandas¶sale_amount_by_trades_date_shipping_type = order_main.groupby(['TradesDate', 'ShippingType'])['TotalSalesAmount'].sum()
sale_amount_by_trades_date_shipping_type_df = pd.DataFrame(sale_amount_by_trades_date_shipping_type).reset_index()
sale_amount_by_trades_date_shipping_type_df
| TradesDate | ShippingType | TotalSalesAmount | |
|---|---|---|---|
| 0 | 2016-01-01 | Family | 3740 |
| 1 | 2016-01-01 | Home | 18370 |
| 2 | 2016-01-01 | SevenEleven | 18780 |
| 3 | 2016-01-01 | Store | 880060 |
| 4 | 2016-01-02 | Family | 18260 |
| ... | ... | ... | ... |
| 7147 | 2019-04-30 | Store | 1682940 |
| 7148 | 2019-05-01 | Family | 4830 |
| 7149 | 2019-05-01 | Home | 16680 |
| 7150 | 2019-05-01 | LocationPickup | 2660 |
| 7151 | 2019-05-01 | SevenEleven | 2270 |
7152 rows × 3 columns
lineplot function for a line¶sns.lineplot(x='TradesDate', y='TotalSalesAmount', hue='ShippingType', data=sale_amount_by_trades_date_shipping_type_df)
plt.show()
matplotlib¶list, ndarray or Seriesfigure function for a "figure" objectaxes function for an "axis" objectshow function for displayfig = plt.figure()
axes = plt.axes()
fig = plt.figure()
axes = plt.axes()
axes.barh(sale_amount_by_payment_type.index, sale_amount_by_payment_type.values)
plt.show()
fig = plt.figure()
axes = plt.axes()
fig = plt.figure()
axes = plt.axes()
x = sale_amount_by_trades_date.index
y = sale_amount_by_trades_date.values
axes.plot(x, y)
plt.show()
import plotly.graph_objs as go
# Create df
df = pd.DataFrame(sale_amount_by_trades_date).reset_index()
# Create figure
layout = go.Layout(
autosize=False,
width=720,
height=480,
margin=go.layout.Margin(
l=50,
r=50,
b=100,
t=100,
pad=4
),
bargroupgap=0.3
)
fig = go.Figure(layout=layout)
fig.add_trace(
go.Scatter(x=list(df.TradesDate), y=list(df.TotalSalesAmount)))
# Set title
fig.update_layout(
title_text="Time series with range slider and selectors"
)
# Add range slider
fig.update_layout(
xaxis=dict(
rangeselector=dict(
buttons=list([
dict(count=1,
label="1m",
step="month",
stepmode="backward"),
dict(count=6,
label="6m",
step="month",
stepmode="backward"),
dict(count=1,
label="YTD",
step="year",
stepmode="todate"),
dict(count=1,
label="1y",
step="year",
stepmode="backward"),
dict(step="all")
])
),
rangeslider=dict(
visible=True
),
type="date"
)
)
fig.show()